word space
Constructing Cross-lingual Consumer Health Vocabulary with Word-Embedding from Comparable User Generated Content
Chang, Chia-Hsuan, Wang, Lei, Yang, Christopher C.
The online health community (OHC) is the primary channel for laypeople to share health information. To analyze the health consumer-generated content (HCGC) from the OHCs, identifying the colloquial medical expressions used by laypeople is a critical challenge. The open-access and collaborative consumer health vocabulary (OAC CHV) is the controlled vocabulary for addressing such a challenge. Nevertheless, OAC CHV is only available in English, limiting its applicability to other languages. This research proposes a cross-lingual automatic term recognition framework for extending the English CHV into a cross-lingual one. Our framework requires an English HCGC corpus and a non-English (i.e., Chinese in this study) HCGC corpus as inputs. Two monolingual word vector spaces are determined using the skip-gram algorithm so that each space encodes common word associations from laypeople within a language. Based on the isometry assumption, the framework aligns two monolingual spaces into a bilingual word vector space, where we employ cosine similarity as a metric for identifying semantically similar words across languages. The experimental results demonstrate that our framework outperforms the other two large language models in identifying CHV across languages. Our framework only requires raw HCGC corpora and a limited size of medical translations, reducing human efforts in compiling cross-lingual CHV.
- Europe > Italy > Tuscany > Florence (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Nevada (0.04)
- (7 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
Word Space
Representations for semantic information about words are neces(cid:173) sary for many applications of neural networks in natural language processing. This paper describes an efficient, corpus-based method for inducing distributed semantic representations for a large num(cid:173) ber of words (50,000) from lexical coccurrence statistics by means of a large-scale linear regression. The representations are success(cid:173) fully applied to word sense disambiguation using a nearest neighbor method .
Word Embeddings and Document Vectors: Part 1. Similarity
This similarity can be as simple as a categorical feature value such as the color or shape of the objects we are classifying, or a more complex function of all categorical and/or continuous feature values that these objects possess. Documents can be classified as well using their quantifiable attributes such as size, file extension etc… Easy! But unfortunately it is the meaning/import of the text contained in the document is what we are usually interested in for classification. The ingredients of text are words (and throw in punctuation as well) and the meaning of a text snippet is not a deterministic function of these constituents. We know that the same set of words but in a different order, or simply with different punctuation can convey different meanings.
Finding Alternate Features in Lasso
Hara, Satoshi, Maehara, Takanori
We propose a method for finding alternate features missing in the Lasso optimal solution. In ordinary Lasso problem, one global optimum is obtained and the resulting features are interpreted as task-relevant features. However, this can overlook possibly relevant features not selected by the Lasso. With the proposed method, we can provide not only the Lasso optimal solution but also possible alternate features to the Lasso solution. We show that such alternate features can be computed efficiently by avoiding redundant computations. We also demonstrate how the proposed method works in the 20 newsgroup data, which shows that reasonable features are found as alternate features.
- North America > United States (0.14)
- Asia > Japan > Honshū > Chūbu > Shizuoka Prefecture > Shizuoka (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Visual Contextual Advertising: Bringing Textual Advertisements to Images
Chen, Yuqiang (Shanghai Jiao Tong University) | Jin, Ou (Shanghai Jiao Tong University) | Xue, Gui-Rong (Shanghai Jiao Tong University) | Chen, Jia (Shanghai Jiao Tong University) | Yang, Qiang (Hong Kong University of Science and Technology)
Advertising in the case of textual Web pages has been studied extensively by many researchers. However, with the increasing amount of multimedia data such as image, audio and video on the Web, the need for recommending advertisement for the multimedia data is becoming a reality. In this paper, we address the novel problem of visual contextual advertising, which is to directly advertise when users are viewing images which do not have any surrounding text. A key challenging issue of visual contextual advertising is that images and advertisements are usually represented in image space and word space respectively, which are quite different with each other inherently. As a result, existing methods for Web page advertising are inapplicable since they represent both Web pages and advertisement in the same word space. In order to solve the problem, we propose to exploit the social Web to link these two feature spaces together. In particular, we present a unified generative model to integrate advertisements, words and images. Specifically, our solution combines two parts in a principled approach: First, we transform images from a image feature space to a word space utilizing the knowledge from images with annotations from social Web. Then, a language model based approach is applied to estimate the relevance between transformed images and advertisements. Moreover, in this model, the probability of recommending an advertisement can be inferred efficiently given an image, which enables potential applications to online advertising.
Word Space
Representations for semantic information about words are necessary for many applications of neural networks in natural language processing. This paper describes an efficient, corpus-based method for inducing distributed semantic representations for a large number of words (50,000) from lexical coccurrence statistics by means of a large-scale linear regression. The representations are successfully applied to word sense disambiguation using a nearest neighbor method. 1 Introduction Many tasks in natural language processing require access to semantic information about lexical items and text segments.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > United States > California > San Mateo County > San Mateo (0.04)
- (3 more...)
Word Space
Representations for semantic information about words are necessary for many applications of neural networks in natural language processing. This paper describes an efficient, corpus-based method for inducing distributed semantic representations for a large number of words (50,000) from lexical coccurrence statistics by means of a large-scale linear regression. The representations are successfully applied to word sense disambiguation using a nearest neighbor method. 1 Introduction Many tasks in natural language processing require access to semantic information about lexical items and text segments.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > United States > California > San Mateo County > San Mateo (0.04)
- (3 more...)
Word Space
Representations for semantic information about words are necessary formany applications of neural networks in natural language processing. This paper describes an efficient, corpus-based method for inducing distributed semantic representations for a large number ofwords (50,000) from lexical coccurrence statistics by means of a large-scale linear regression. The representations are successfully appliedto word sense disambiguation using a nearest neighbor method. 1 Introduction Many tasks in natural language processing require access to semantic information about lexical items and text segments.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > United States > California > San Mateo County > San Mateo (0.04)
- (3 more...)